Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# Hi IODS, I am eager to learn about open data, open science and data science. I expect to improve my skills on data analysis and interpretation. I also hope to learn new things. This course was highly recommended by a collegue who attended it in the past. So I am excited and ready for learning.
date()
## [1] "Fri Dec 2 13:44:03 2022"
# I am familiar to R but I found some information and suggestion in the book that might be useful.I usually save R file and folders for a particular project in a designated project folder, but never have an RStudio project. This might be usefult to try and see the difference compared to my approach. I am also familiar to most of the contents in chapter 1-4. However it was fun to run the codes and realize ways to improve my codes and plots.
date()
## [1] "Fri Dec 2 13:44:03 2022"
Describe the work you have done this week and summarize your learning.
date()
## [1] "Fri Dec 2 13:44:03 2022"
This week I studied chapter 7 “Linear regression” from the book R for health data science, available here: https://argoshare.is.ed.ac.uk/healthyr_book/chap07-h1.html. I am also learning data wrangling, which means how to manage large datatables and create small, managable working size tables. After that I am learning regression analysis and model validation with univariate and multivariate regression models.
Data wrangling (max 5 points)
#The codes for this exercise is available inside data folder along with the file created
#create_learning2014.R
#assignment2_regression_and_model_validation_data_wrangling.csv
Analysis (max 15 points)
Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url: https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt . (The separator is a comma “,” and the file includes a header). Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-2 points)
This dataset comes from survey during the study “The relationship between learning approaches and students’ achievements in an introductory statistics course in Finland, 2015”. The results of the study are available here: https://www.slideshare.net/kimmovehkalahti/the-relationship-between-learning-approaches-and-students-achievements-in-an-introductory-statistics-course-in-finland
learning2014 <- read.csv("data/assignment2_regression_and_model_validation_data_wrangling.csv")
dim(learning2014)
## [1] 166 7
The dataset has 166 rows and 7 columns. I am using a shorter version of the dataset produced by data wrangling. Here is the structure of the dataset:
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Each column is described below: Gender represents Caterogical variable with Male (M) and Female (F). Age represents Age (in years) derived from the date of birth. Attitude is numerical variable representing Global attitude toward statistics. deep is numerical variable representing Deep approach calculated by taking mean of several other variable output from the survey that aimed to understand about seeking Meaning, relating ideas and use of evidence stra is numerical variable representing Strategic approach calculated by taking mean of several other variables that aimed to understand about Organized Studying and Time Management. surf is numerical variable representing Surface approach calculated by taking mean of several other variables that aimed to understand about Lack of Purpose, Unrelated Memorising and Syllabus-boundness. Points is numerical variable with integers representing Exam points.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
Here is a summary table of the above dataset:
summary(learning2014)
## gender Age Attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
In this study female participants were almost double in number compared to male participants. Based on the summary analysis deep, surface, strategic questions and points are normally distributed as their mean and median values are very close. Age distribution seems to be scewed towards left, i.e younger participants (Standard deviation: 21 - 27 years old), with a few older students (maximum 55 years old).
Here is a graphical overview from the dataset for better understanding of the relationship of available variables:
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# Draw the plot
p <- ggpairs(learning2014, mapping = aes(col = gender),
lower = list(combo = wrap("facethist", bins = 20))) + theme_bw()
p
Male students are more likely to show high attitude towards statistics. There was almost no correlation between age and other variables. Male students show significant negative correlation in Surface approach and their attitude as well as response to deep approach questions. There was significant negative correlation between attitude and points scored for both genders. Also negative correlation is observed in both genders points scored and their response to deep and surface approach, although the results were not significant.
Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)
# fit a linear model
my_model <- lm(Points ~ Attitude + stra+ surf, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## Attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In this multivariate model the most important factor affecting the student’s final exam result appears to be only Attitude (p-value less than 0)! There is no significant relationship between strategic and surface approach to points scored. The R-squared value of 0.2074 implies that the model can explain 20% or about one-fifth of the variation in the outcome.
# fit a linear model with variangle with statistically significant relationship with Points, i.e Attitude
my_model <- lm(Points ~ Attitude, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)
Target variable ‘Points’ has statistically significant relationship with explanatory variable ‘Attitute’ (P-value less than 0). At estimated (theoretical) attitude value of 0 the points scored would be 11.64 with a standard error 1.83. For every point of attitude increased, there is 3.53 more exam points scored with a standard error of 0.57.
R-squared is always between 0 and 1, 0 (representing 0%) indicates that the model explains none of the variability of the response data around its mean. 1 (representing 100%) indicates that the model explains all the variability of the response data around its mean. Multiple R-squared is used for evaluating how well the model fits the data. In this case, Multiple R-squared value of 0.1906 implies that the model can explain only 19.06% of the variation in the outcome.
This is how the model looks
library(ggplot2)
qplot(Attitude, Points, data = learning2014) + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")
## `geom_smooth()` using formula 'y ~ x'
Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)
#R makes it easy to graphically explore the validity of your model assumptions. If you give a linear model object as the first argument to the `plot()` function, the function automatically assumes you want diagnostic plots and will produce them. You can check the help page of plotting an lm object by typing `?plot.lm` or `help(plot.lm)` to the R console.
#In the plot function you can then use the argument `which` to choose which plots you want. `which` must be an integer vector corresponding to the following list of plots:
#which | graphic
#----- | --------
#1 | Residuals vs Fitted values
#2 | Normal QQ-plot
#3 | Standardized residuals vs Fitted values
#4 | Cook's distances
#5 | Residuals vs Leverage
#6 | Cook's distance vs Leverage
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
#Residuals vs Fitted values
plot(my_model, which = 1)
The Residuals vs Fitted plot studies the variance of the residuals. This plot represents the deviation of the observed values of an element of a statistical sample from its theoretical value. The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest.
Here, the linearity of the regression model is supported by homoscedasticity, or homogeneity of variances. Homoscedasticity is an assumption of equal or similar variances in different groups being compared. In this model we see that residuals appear close to 10 points difference (positive and negative) from the fitted values. There are three exceptional observations 145, 56 and 35.
#Normal QQ-plot
plot(my_model, which = 2)
A Q-Q (quantile-quantile) plot is a probability plot. It is used for comparing two probability distributions by plotting their quantiles against each other. Here, the two distributions being compared are the fitted model vs the observed values. Our model is normally distrubuted. There are some non fitting observations at the extremes of the distribution. These are expected to have very little impact on the model. The observations 145, 56 and 35 again appear as not good representations of the fitted model.
#Residuals vs Leverage
plot(my_model, which = 5)
Here we plotted the standardized residuals vs the leverage of the fitted model. Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. This plot shows the impact of every single observation on the fitted model. There are some results that have a high impact on the model,suggesting that the results are driven by a few data points. Here again 56 and 35and a new observation 71 have influence but 145 does not have very high leverage.
*After completing all the phases above you are ready to submit your Assignment for the review (using the Moodle Workshop below). Have the two links (your GitHub repository and your course diary) ready!
The link to my GitHub repository: https://github.com/adhisadi/IODS-project The link to my course diary: https://adhisadi.github.io/IODS-project/
date()
## [1] "Fri Dec 2 13:44:13 2022"
Read the joined student alcohol consumption data into R either from your local folder (if you completed the Data wrangling part) or from this url (in case you got stuck with the Data wrangling part) https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv (In the above linked file, the column separator is a comma and the first row includes the column names). Print out the names of the variables in the data and describe the data set briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-1 point)
I have created the file in data wrangling part. But still reading from both data wrangling and online source, just to check
library(readr)
setwd("/Users/admin_adhisadi/workspace/School_work/IODS-project")
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
alc2 <- read.csv("data/math_por.csv", sep = ",", header=TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
colnames(alc2)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
str(alc)
## spec_tbl_df [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr [1:370] "U" "U" "U" "U" ...
## $ famsize : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr [1:370] "A" "T" "T" "T" ...
## $ Medu : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr [1:370] "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr [1:370] "teacher" "other" "other" "services" ...
## $ reason : chr [1:370] "course" "course" "other" "home" ...
## $ guardian : chr [1:370] "mother" "father" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
## $ famsup : chr [1:370] "no" "yes" "no" "yes" ...
## $ activities: chr [1:370] "no" "no" "no" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "no" "yes" "yes" "yes" ...
## $ romantic : chr [1:370] "no" "no" "no" "yes" ...
## $ famrel : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr [1:370] "no" "no" "yes" "yes" ...
## $ absences : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. paid = col_character(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. alc_use = col_double(),
## .. high_use = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
str(alc2)
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
# Analysis done using one of the datasets
Student performance dataset:
On week 3, we are working with a new dataset about student performance. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. More information available here: https://archive.ics.uci.edu/ml/datasets/Student+Performance ‘alc_use’ column shows average of weekday and weekend alcohol consumption (Dalc + Walc) ‘high_use’ column was created using ‘alc_use’ column. TRUE is used for students for which ‘alc_use’ is greater than 2 (and FALSE otherwise)
Variable names included for the analysis and structure of dataset:
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
str(alc)
## spec_tbl_df [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr [1:370] "U" "U" "U" "U" ...
## $ famsize : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr [1:370] "A" "T" "T" "T" ...
## $ Medu : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr [1:370] "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr [1:370] "teacher" "other" "other" "services" ...
## $ reason : chr [1:370] "course" "course" "other" "home" ...
## $ guardian : chr [1:370] "mother" "father" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
## $ famsup : chr [1:370] "no" "yes" "no" "yes" ...
## $ activities: chr [1:370] "no" "no" "no" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "no" "yes" "yes" "yes" ...
## $ romantic : chr [1:370] "no" "no" "no" "yes" ...
## $ famrel : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr [1:370] "no" "no" "yes" "yes" ...
## $ absences : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. paid = col_character(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. alc_use = col_double(),
## .. high_use = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)
Basically with this dataset we are studying relationship between student’s behavior and alcohol consumption. I am taking these 4 variables: famsize, Pstatus, activities and absences. Here are description of the columns and my hypothesis. My hypothesis mostly comes from stereotypes.
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ -
greater than 3).
Big family size could mean extended family as uncle aunt grandparents or
more children in the family, depends on the culture I think. I assume
here big family size means more number of children. Hence my hypothesis
is that there may be higher alcohol consumption with increased family
size. This could be due to several factors such as difficulty in
managing financial and emotional needs of each children by the
parents.
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart). Students raised by single parent have increased alcohol consumption, similar reasons as above: difficulty in managing financial and emotional needs of child.
activities - extra-curricular activities (binary: yes or no). Increased extra-curricular activities means active and busy life and more social support. Hence there is less likelihood of alcohol consumption.
absences - number of school absences (numeric: from 0 to 93). Increased number of school absences means there may be several problems in the student’s life. Hence there is increased likelihood of alcohol consumption. In addition all of the above factors, increased family size, parents living apart, and low extra curricular activities is likely to contribute to higher absences along with increased alcohol consumption.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gmodels)
library(ggplot2)
CrossTable(alc$high_use, alc$famsize, dnn = c("High alcohol consumption", "Family size"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | Family size
## High alcohol consumption | GT3 | LE3 | Row Total |
## -------------------------|-----------|-----------|-----------|
## FALSE | 192 | 67 | 259 |
## | 0.181 | 0.462 | |
## | 0.741 | 0.259 | 0.700 |
## | 0.722 | 0.644 | |
## | 0.519 | 0.181 | |
## -------------------------|-----------|-----------|-----------|
## TRUE | 74 | 37 | 111 |
## | 0.422 | 1.078 | |
## | 0.667 | 0.333 | 0.300 |
## | 0.278 | 0.356 | |
## | 0.200 | 0.100 | |
## -------------------------|-----------|-----------|-----------|
## Column Total | 266 | 104 | 370 |
## | 0.719 | 0.281 | |
## -------------------------|-----------|-----------|-----------|
##
##
p1 <- alc %>%
ggplot(aes(x = famsize, fill = high_use)) +
geom_bar()
p2 <- alc %>%
ggplot(aes(x = famsize, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")
library(patchwork)
p1 + p2
Large family size does not mean high alcohol consumption.
CrossTable(alc$high_use, alc$Pstatus, dnn = c("High alcohol consumption", "Parental status"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | Parental status
## High alcohol consumption | A | T | Row Total |
## -------------------------|-----------|-----------|-----------|
## FALSE | 26 | 233 | 259 |
## | 0.014 | 0.002 | |
## | 0.100 | 0.900 | 0.700 |
## | 0.684 | 0.702 | |
## | 0.070 | 0.630 | |
## -------------------------|-----------|-----------|-----------|
## TRUE | 12 | 99 | 111 |
## | 0.032 | 0.004 | |
## | 0.108 | 0.892 | 0.300 |
## | 0.316 | 0.298 | |
## | 0.032 | 0.268 | |
## -------------------------|-----------|-----------|-----------|
## Column Total | 38 | 332 | 370 |
## | 0.103 | 0.897 | |
## -------------------------|-----------|-----------|-----------|
##
##
p1 <- alc %>%
ggplot(aes(x = Pstatus, fill = high_use)) +
geom_bar()
p2 <- alc %>%
ggplot(aes(x = Pstatus, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")
p1 + p2
Parent’s cohabitation status does not seem to contribute to alcohol consumption.
CrossTable(alc$high_use, alc$activities, dnn = c("High alcohol consumption", "Extra-curricular activities"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | Extra-curricular activities
## High alcohol consumption | no | yes | Row Total |
## -------------------------|-----------|-----------|-----------|
## FALSE | 120 | 139 | 259 |
## | 0.224 | 0.210 | |
## | 0.463 | 0.537 | 0.700 |
## | 0.670 | 0.728 | |
## | 0.324 | 0.376 | |
## -------------------------|-----------|-----------|-----------|
## TRUE | 59 | 52 | 111 |
## | 0.523 | 0.490 | |
## | 0.532 | 0.468 | 0.300 |
## | 0.330 | 0.272 | |
## | 0.159 | 0.141 | |
## -------------------------|-----------|-----------|-----------|
## Column Total | 179 | 191 | 370 |
## | 0.484 | 0.516 | |
## -------------------------|-----------|-----------|-----------|
##
##
p1 <- alc %>%
ggplot(aes(x = activities, fill = high_use)) +
geom_bar()
p2 <- alc %>%
ggplot(aes(x = activities, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")
p1 + p2
Extra-curricular activities does not seem to affect alcohol consumption.
library(ggplot2)
#let's plot absences as it is numeric
p1 <- ggplot(alc, aes(x= absences, fill=high_use)) + geom_density(alpha=0.4)
p1+ theme_bw()
#model
m <- glm(high_use ~ (alc$famsize == "GT3") + (activities == "no") + (Pstatus == "A") + absences, data = alc, family = "binomial")
#odds ratio
OR <- coef(m) %>% exp
#confidence interval
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
#summary
summary(m)
##
## Call:
## glm(formula = high_use ~ (alc$famsize == "GT3") + (activities ==
## "no") + (Pstatus == "A") + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2960 -0.8113 -0.7217 1.2363 1.8858
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1310 0.2708 -4.177 2.96e-05 ***
## alc$famsize == "GT3"TRUE -0.3660 0.2568 -1.425 0.15404
## activities == "no"TRUE 0.2847 0.2350 1.211 0.22575
## Pstatus == "A"TRUE -0.2759 0.4029 -0.685 0.49336
## absences 0.0900 0.0234 3.846 0.00012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 430.89 on 365 degrees of freedom
## AIC: 440.89
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) alc$famsize == "GT3"TRUE activities == "no"TRUE
## -1.13102023 -0.36599660 0.28471938
## Pstatus == "A"TRUE absences
## -0.27594749 0.08999505
#odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3227039 0.1869931 0.5422024
## alc$famsize == "GT3"TRUE 0.6935052 0.4203051 1.1526892
## activities == "no"TRUE 1.3293889 0.8391496 2.1118619
## Pstatus == "A"TRUE 0.7588528 0.3313845 1.6273877
## absences 1.0941689 1.0473341 1.1480407
The results show that increased family size, parents living apart, and low extra curricular activities are not likely to contribute to increased alcohol consumption. OR = 1.09 shows that the odds that the student will be consuming alcohol in excess is nearly always. Confidence intervals for other variables are varying (0.18, highly unlikely to 2.11, highly likely). My hypothesis was not proven to be correct.
# predictions versus actual values
# predict() the probabilty of high_use
probability <- predict(m, type = "response")
# add the predicted probability to 'alc'
alc <- mutate(alc, probability = probability)
# use probability to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5 )
# tabulate target variable versus predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>%
prop.table() %>%
addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67837838 0.02162162 0.70000000
## TRUE 0.26756757 0.03243243 0.30000000
## Sum 0.94594595 0.05405405 1.00000000
This shows that my hypothesis was bad. These common negative stereotypes I mentioned in my hypothesis should be avoided. The only variable that work from my model is absence, which is shown below:
# new model
m2 <- glm(high_use ~ absences, data = alc, family = "binomial")
# Odds ratios
OR2 <- coef(m2) %>% exp
# confidence intervals
CI2 <- confint(m2) %>% exp
## Waiting for profiling to be done...
#probability of high alcohol consumption
probability_2 <- predict(m2, type = "response")
# add predicted probability to 'alc'
alc <- mutate(alc, probability_2 = probability_2)
# use probability to make a prediction of high_use
alc <- mutate(alc, prediction_2 = probability_2 > 0.5 )
# Summary of the model
summary(m2)
##
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3598 -0.8209 -0.7318 1.2658 1.7419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.26945 0.16094 -7.888 3.08e-15 ***
## absences 0.08867 0.02317 3.827 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 434.52 on 368 degrees of freedom
## AIC: 438.52
##
## Number of Fisher Scoring iterations: 4
# Coefficients of the model
coef(m2)
## (Intercept) absences
## -1.26944532 0.08866504
# Odds ratios with their confidence intervals
cbind(OR2, CI2)
## OR2 2.5 % 97.5 %
## (Intercept) 0.2809874 0.2030806 0.3819961
## absences 1.0927146 1.0464586 1.1459894
The results show that students who are more absent from their classes are more likely consume excess alcohol. Let’s see how these two models compare with a likelihood ratio test:
## Analysis of Deviance Table
##
## Model 1: high_use ~ (alc$famsize == "GT3") + (activities == "no") + (Pstatus ==
## "A") + absences
## Model 2: high_use ~ absences
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 365 430.89
## 2 368 434.52 -3 -3.6311 0.3042
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability_2, y = high_use, col= prediction_2)) + xlab("probability")
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction_2) %>%
prop.table() %>%
addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68648649 0.01351351 0.70000000
## TRUE 0.27567568 0.02432432 0.30000000
## Sum 0.96216216 0.03783784 1.00000000
date()
## [1] "Fri Dec 2 13:44:16 2022"
Create a new R Markdown file and save it as an empty file named ‘chapter4.Rmd’. Then include the file as a child file in your ‘index.Rmd’ file. Perform the following analysis in the file you created. Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here. (0-1 points) https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
library(dplyr)
library(ggplot2)
library(GGally)
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Boston dataset consists of housing Values in Suburbs of Boston. It has 506 rows and 14 columns. Here is the description of each column: crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)21000(Bk−0.63)2 where BkBk is the proportion of blacks by town.
lstat:lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
cor_matrix <- cor(Boston)
cor_matrix <- cor_matrix %>% round(digits = 2)
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")
library("psych") #for p-value table
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
cor_test_mat <- corr.test(cor_matrix)$p # Apply corr.test function
cor_test_mat
## crim zn indus chas nox
## crim 0.000000e+00 5.460004e-02 5.861585e-03 1.0000000 7.907814e-03
## zn 2.284457e-03 0.000000e+00 1.271309e-04 1.0000000 2.417195e-04
## indus 1.105960e-04 1.672775e-06 0.000000e+00 1.0000000 9.560598e-08
## chas 4.167137e-01 9.742994e-01 7.988462e-01 0.0000000 1.000000e+00
## nox 1.550552e-04 3.357216e-06 1.074225e-09 0.9200326 0.000000e+00
## rm 2.431550e-03 2.434360e-03 3.886776e-04 0.4966148 1.649605e-03
## age 6.401623e-04 3.007306e-07 1.054161e-07 0.9626125 5.329690e-09
## dis 5.756140e-04 2.863417e-07 5.617665e-08 0.9200861 2.056115e-09
## rad 1.066445e-06 2.865584e-04 1.712863e-06 0.5139848 4.791323e-06
## tax 2.762339e-06 1.722141e-04 1.424366e-07 0.5004301 1.058846e-06
## ptratio 1.738565e-03 8.983828e-04 7.449341e-04 0.2954829 4.565702e-03
## black 6.091967e-05 8.087550e-03 3.260073e-04 0.6344478 3.329302e-04
## lstat 5.858634e-05 7.173685e-05 8.898971e-07 0.5508139 4.162472e-06
## medv 1.706341e-04 8.170349e-04 4.791024e-05 0.3461764 2.671186e-04
## rm age dis rad tax
## crim 5.460004e-02 2.304584e-02 2.129772e-02 8.259000e-05 2.016508e-04
## zn 5.460004e-02 2.405844e-05 2.319368e-05 1.318169e-02 8.531706e-03
## indus 1.554711e-02 8.854952e-06 4.775015e-06 1.284648e-04 1.167980e-05
## chas 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## nox 4.948816e-02 4.636831e-07 1.809382e-07 3.353926e-04 8.259000e-05
## rm 0.000000e+00 5.460004e-02 1.170387e-01 5.132770e-02 2.777919e-02
## age 2.275002e-03 0.000000e+00 8.982624e-08 4.066218e-03 1.841719e-03
## dis 6.665341e-03 9.980694e-10 0.000000e+00 2.689767e-03 1.433697e-03
## rad 1.974142e-03 7.530033e-05 4.482945e-05 0.000000e+00 6.702657e-11
## tax 8.187115e-04 2.877686e-05 2.166334e-05 7.365557e-13 0.000000e+00
## ptratio 2.925326e-04 4.236675e-03 6.502152e-03 3.354635e-04 2.702118e-04
## black 1.717361e-02 1.503185e-03 1.809994e-03 2.139846e-05 2.902442e-05
## lstat 2.721164e-06 7.283252e-06 6.071380e-05 2.338660e-05 5.374022e-06
## medv 1.140787e-07 4.559715e-04 2.010551e-03 1.193301e-04 3.677290e-05
## ptratio black lstat medv
## crim 4.948816e-02 0.0034606865 3.398008e-03 8.531706e-03
## zn 2.874825e-02 0.1264815268 3.945527e-03 2.777919e-02
## indus 2.607269e-02 0.0142572745 7.030187e-05 2.826704e-03
## chas 1.000000e+00 1.0000000000 1.000000e+00 1.000000e+00
## nox 8.674833e-02 0.0142572745 2.955355e-04 1.282169e-02
## rm 1.318169e-02 0.2404305720 2.013662e-04 9.468528e-06
## age 8.473349e-02 0.0465987475 4.952611e-04 1.778289e-02
## dis 1.170387e-01 0.0494881609 3.460687e-03 5.132770e-02
## rad 1.425727e-02 0.0014336967 1.520129e-03 6.205164e-03
## tax 1.282169e-02 0.0018417188 3.708075e-04 2.243147e-03
## ptratio 0.000000e+00 0.1264815268 1.425727e-02 1.841719e-03
## black 7.905095e-03 0.0000000000 1.812792e-02 4.948816e-02
## lstat 3.240290e-04 0.0004770506 0.000000e+00 1.908560e-06
## medv 2.901205e-05 0.0016797274 2.219256e-08 0.000000e+00
corrplot(cor_matrix,p.mat = cor_test_mat,insig = "p-value")
This plot shows the correlation matrix. The areas of circles show the absolute value of corresponding correlation coefficients. Lighter shade of circle means decreasing correlation (white is no correlation). Darker shade of blue means increasing positive correlation and darker shade towards red means increasing negative correlation.The number inside the cells show p-values; p-values have been added only to those cells, where the correlation coefficients are not significant. We see both positive and negative correlations among several varaibles with significant p-value. We see high negative correlation of dis with indus, nox and age and that also between lstat and medv. We see highest positive correlation of rad with tax.
cor_matrix <- as.data.frame(cor_matrix)
p <- ggpairs(cor_matrix, mapping = aes(),
lower = list(combo = wrap("facethist", bins = 20))) + theme_bw()
p
We see bimodal distributions for most variables. Variable zn, rm, dis, black appear to be skewed right.
# Distribution can be viewed also like this:
# Note this part added later for improvement (after the assignment deadline was over)
ggplot(data = melt(Boston), aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
## Using as id variables
Here is the summary of the dataset
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim) #required for next step
# summary of the scaled crime rate
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
# create a categorical variable of the crime rate in the Boston dataset, Use the quantiles as the break points
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
table(crime)
## crime
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 127 126 126 127
#Drop the old crime rate variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
#Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
By scaling, we have subtracted the column means from the corresponding columns and divided the difference with standard deviation. \[scaled(x) = \frac{x - mean(x)}{ sd(x)}\] As a result in scaled output we see that for each variable, now the mean is 0; min, 1st Quadrant, 2nd Quadrant are negative numbers; 3rd, 4th quadrant and max are positive numbers.
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
# linear discriminant analysis on train set
#I am taking boston_scaled from here as somehow the lda function produces warning in my previous boston_scaled dataframe.
boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt",sep=",", header = T)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2301980 0.2450495 0.2722772
##
## Group means:
## zn indus chas nox rm age
## low 1.02123632 -0.9072611 -0.15653200 -0.8872605 0.4315419 -0.8592762
## med_low -0.09881021 -0.2554344 -0.06065701 -0.5357652 -0.1342905 -0.2815545
## med_high -0.38242941 0.1319232 0.12535782 0.3053716 0.1602694 0.3189338
## high -0.48724019 1.0169558 -0.02178633 1.0686460 -0.4610138 0.8126151
## dis rad tax ptratio black lstat
## low 0.9021010 -0.6846209 -0.7340671 -0.4341205 0.38382727 -0.75061951
## med_low 0.3490730 -0.5570619 -0.4744850 -0.0430348 0.32100583 -0.13859703
## med_high -0.3529831 -0.4018374 -0.3315206 -0.2687348 0.07302245 -0.05396767
## high -0.8586216 1.6397657 1.5152267 0.7826832 -0.72481360 0.87629325
## medv
## low 0.51034154
## med_low -0.01292014
## med_high 0.22619348
## high -0.66375507
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.071863482 0.86242861 -0.84833398
## indus -0.019259668 -0.21520528 0.31066572
## chas -0.058390920 -0.06353660 0.09584073
## nox 0.444288434 -0.68013671 -1.44449341
## rm -0.116720317 -0.10308821 -0.14552478
## age 0.203697174 -0.16323830 0.04514572
## dis -0.095736488 -0.22202915 0.32210329
## rad 3.167080140 0.91161969 -0.01832820
## tax 0.005680403 -0.03881713 0.57208000
## ptratio 0.105576691 0.05973318 -0.22974536
## black -0.100341892 0.01610954 0.13227131
## lstat 0.225974255 -0.28091242 0.29854009
## medv 0.189728166 -0.40593418 -0.24607747
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9546 0.0350 0.0105
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric. Leaving this as character gives error.
classes <- as.numeric(train$crime)
# Draw the LDA (bi)plot
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)
test <- boston_scaled[-ind,]
# save the crime categories from the test set
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 12 1 0
## med_low 7 21 5 0
## med_high 0 8 18 1
## high 0 0 0 17
plot(table(correct = correct_classes , predicted = lda.pred$class))
#wrong predictions in the (training) data
mean(correct_classes != lda.pred$class)
## [1] 0.3333333
Here is the test error rate:
error <- mean(correct_classes != lda.pred$class)
errorrate <- round(error*100, digits = 2)
paste0("Test errorrate is ",errorrate, "%")
## [1] "Test errorrate is 33.33%"
Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
set.seed(123)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line') + scale_x_continuous(breaks=seq(1, 10, 1))
The value of total WCSS changes radically at 2. So two clusters would seem optimal
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
This is impossible to intrepret. Lets try different ways to look at clusters, 1.by dividing the dataset to include only few columns, 2. each correlation separately (doing only two that seem to have correlation for now)
pairs(boston_scaled[, 1:7], col = km$cluster)
pairs(boston_scaled[, 8:14], col = km$cluster)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%() masks ggplot2::%+%()
## ✖ psych::alpha() masks ggplot2::alpha()
## ✖ tidyr::expand() masks reshape::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ reshape::rename() masks dplyr::rename()
## ✖ MASS::select() masks dplyr::select()
clusters <- factor(km$cluster)
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled %>% ggplot(aes(x = indus, y = nox, col = clusters)) +
geom_point()
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled %>% ggplot(aes(x =lstat, y = medv, col = clusters)) +
geom_point()
#ggpairs(boston_scaled, mapping = aes(col = clusters), lower = list(combo = wrap("facethist", bins = 20))) + theme_bw()
#commented because it is still crowded figure.
We see positive correlation between indus and nox for cluster 1 but no correlation for cluster 2. There is negative correlation between lstat and medv for both clusters.
date()
## [1] "Fri Dec 2 13:44:45 2022"
Data wrangling part completed and saved by the name create_human_assignment5.R in data folder. Human data saved as human.txt.
Loading from my data wrangling as well as link. Using only one for analysis
human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt",sep =",", header = T)
dim(human)
## [1] 155 8
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
human2 <- read.table("data/human.txt")
dim(human2)
## [1] 155 8
str(human2)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
#they are same
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
library(GGally)
library(ggplot2)
p <- ggpairs(human, mapping = aes(), lower = list(combo=wrap("facethist", bins=20)))+theme_bw()
p
Correlation and distribution can also be visualized separately
# Correlation
library(corrplot)
cor_mat <- cor(human)
corrplot.mixed(cor_mat)
There is significant positive correlation OF Edu2.FM with Edu.Exp, Life.Exp and GNI. There is significant negative correlaton of Edu2.FM with Mat.Mor and Ado.Birth. Also positive correlation between Labo.FM with Mat.Mor and Parli.F. Significant positive correlation also seen in Edu.Exp with Life.Exp, GNI and Parli.F. Significant negative correlaiton observed of Edu exp with Mat.Mor and Ado.Birth. GNI shows significant negative correlation with Mat.Mor and Ado.Birth. Also Mat.Mor is significantly positively correlated with Ado.Birth.
# Distribution
library(reshape)
ggplot(data = melt(human), aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
## Using as id variables
Edu.Exp is normally distributed. Edu2.FM is slightly skewed left, more or less normally distributed. Labo.FM and Life.Ep are skewed left. GNI, Mat.Mor, Ado.Birth skewed right and Parli.F more or less skewed right.
Here is the summary of the data:
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Standardize the variables in the human data and repeat the above
analysis. Interpret the results of both analysis (with and without
standardizing). Are the results different? Why or why not? Include
captions (brief descriptions) in your plots where you describe the
results by using not just your variable names, but the actual phenomena
they relate to. (0-4 points)
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
# draw a biplot of the principal component representation and the original variables
pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col = c("grey40", "deeppink2"))
We know that BY scaling we subtract the column means from the corresponding columns and divided the difference with standard deviation. This brings the mean of each variable to zero. This can be seen as the difference in the plots if we compare the plots. In the unscaled plot, we get the warning “Warning: zero-length arrow is of indeterminate angle and so skipped”. So we see only one arrow for GNI.
In loading plot, the orientation (direction) of the vector, with respect to the principal component space, in particular, its angle with the principal component axes defines its contribution to the principal component: the more parallel to a principal component axis is a vector, the more it contributes only to that PC. In case of unscaled where we see only GNI, GNI is parallel to PC1 axis, hence GNI has strong influence on PC1.
For scaled data, since the 0 is at the middle for both PCs, we see all variables labelled in the plot. Parli.F and Labo.FM strongly influence PC2. Edu.Exp, GNI, Edu2.FM, Life.Exp, Mat.Mor and Ado.Birth have strong say in PC1. Result on GNI is unchanged betweem scaled and unscaled data. Scaling and unscaling also seems to affect how the clusters of countries look based on these variables. In loading plot the angles between vectors of different variables show their correlation in this space: small angles represent high positive correlation, right angles represent lack of correlation, opposite angles represent high negative correlation. In unscaled since there is only one line, we can not intrepret this but in scaled we can. High positive correlation observed between Labo.FM and Parli.F, Mat.Mor and Ado.Birth. Also high positive correlation between Edu.Exp, GNI, Life.Exp and Edu2.FM.
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)
Loading plot in the above biplot can also be intrepreted by the length in the space; the longer the vector, the more variability of this variable is represented by the two displayed principal components; short vectors are thus better represented in other dimension. Based on the length of vectors in the scaled plot, PC1 and PC2 seem to be mostly representing variabiliy in education experience, Life expectancy, maternal mortality ratio and perhaps also ratio of Female and Male populations with secondary education.
The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions). Load the tea dataset and convert its character variables to factors: Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.
library(dplyr)
library(tidyr)
library(ggplot2)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
#View(tea) #commented because it opens new R tab to view the data everytime I run the code, kindof annoying
# visualize the dataset
library(dplyr)
library(tidyr)
# including only some column names to keep in the dataset
#tea_time <- tea %>% select(Tea, How, how, sugar, where, lunch) #somehow index.Rmd gives me error with this code.
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, keep_columns)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(keep_columns)` instead of `keep_columns` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") +geom_bar() +theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Most people see to prefer packaged tea bag. Also most people seem no to prefer lunch time as tea time and also not prefer lemon, milk or anything in their tea. There are almost equal number of people prefering sugar and no sugar. They seem to mostly consume earl grey and tea bough tat chain store.
Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)
In the final plot different values for “ind” present in same column of the data have same color.
library(FactoMineR)
#res <- MCA(tea, quanti.sup=19, quali.sup=c(20:36), graph = F)
This code takes forever even when the graphs are not plotted. So I will also run the same with smaller dataset “tea_time” created with less number of columns
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA, I have plotted viariable biplot as well as individual biplot
plot.MCA(mca, invisible=c("var","quali.sup"), cex=0.7)
plot(mca, invisible=c("ind","quali.sup"), cex=0.7)
plot(mca, invisible=c("ind"))
The scatterplots are quite homogeneous with some extreme points such as unpackaged and tea shop in individual plot.
# visualize MCA, I have plotted viariable biplot as well as individual biplot
plot(mca, graph.type = "classic")
plot(mca, invisible=c("var"), graph.type = "classic")
plot(mca, invisible=c("ind"), graph.type = "classic")
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
dimdesc(mca)
## $`Dim 1`
##
## Link between the variable and the categorical variable (1-way anova)
## =============================================
## R2 p.value
## how 0.70769782 4.743070e-80
## where 0.70169065 9.728145e-79
## Tea 0.12597089 2.072984e-09
## sugar 0.06510460 7.631229e-06
## How 0.07575531 3.401438e-05
##
## Link between variable abd the categories of the categorical variables
## ================================================================
## Estimate p.value
## how=unpackaged 0.7352354 9.296592e-50
## where=tea shop 0.7755232 1.016669e-49
## Tea=black 0.1275555 1.994908e-06
## sugar=No.sugar 0.1349392 7.631229e-06
## How=lemon 0.2740418 3.863682e-05
## Tea=green 0.1343915 3.013495e-03
## How=milk -0.2576104 2.550801e-03
## how=tea bag+unpackaged -0.1144382 3.529149e-05
## sugar=sugar -0.1349392 7.631229e-06
## where=chain store+tea shop -0.1273957 1.660721e-06
## Tea=Earl Grey -0.2619470 2.472466e-10
## how=tea bag -0.6207973 1.180088e-44
## where=chain store -0.6481275 1.340115e-45
##
## $`Dim 2`
##
## Link between the variable and the categorical variable (1-way anova)
## =============================================
## R2 p.value
## where 0.68148468 1.640037e-74
## how 0.52173497 2.696444e-48
## How 0.19031574 1.641412e-13
## Tea 0.10764680 4.515572e-08
## lunch 0.06362761 9.756757e-06
##
## Link between variable abd the categories of the categorical variables
## ================================================================
## Estimate p.value
## where=chain store+tea shop 0.71592830 1.221027e-64
## how=tea bag+unpackaged 0.58116824 5.608738e-44
## How=other 0.62818710 1.424535e-08
## lunch=lunch 0.18210707 9.756757e-06
## Tea=Earl Grey 0.18485576 3.974787e-03
## How=milk -0.16271132 1.519564e-02
## How=lemon -0.03039274 1.168288e-03
## lunch=Not.lunch -0.18210707 9.756757e-06
## Tea=green -0.35458225 6.009202e-09
## How=alone -0.43508305 2.039306e-10
## how=unpackaged -0.46020270 1.920895e-11
## where=tea shop -0.57260797 8.006916e-13
## how=tea bag -0.12096554 4.851713e-13
## where=chain store -0.14332033 5.610381e-18
##
## $`Dim 3`
##
## Link between the variable and the categorical variable (1-way anova)
## =============================================
## R2 p.value
## Tea 0.40989341 9.619982e-35
## How 0.39383806 5.776849e-32
## sugar 0.33619389 2.412708e-28
## lunch 0.11110184 3.233374e-09
## where 0.05454983 2.411784e-04
##
## Link between variable abd the categories of the categorical variables
## ================================================================
## Estimate p.value
## Tea=Earl Grey 0.32078574 1.605342e-28
## sugar=sugar 0.27170087 2.412708e-28
## How=lemon 0.77415042 1.104740e-17
## lunch=lunch 0.22062782 3.233374e-09
## where=tea shop 0.22015318 1.043046e-04
## How=alone 0.09880062 7.713644e-03
## where=chain store -0.14933787 4.568451e-03
## lunch=Not.lunch -0.22062782 3.233374e-09
## How=other -1.03063448 6.429911e-16
## sugar=No.sugar -0.27170087 2.412708e-28
## Tea=black -0.38804124 4.898283e-33
I fOUND it quite impossible to say anything on the first graphs. Most datapoints lie around the center. The plots seem pretty homogenous with some extreme points. So I am also using dimdesc function which is used for intrepretation of MCA.
The first principal component is characterized by the variables “how”, “where” followed by “tea”, “sugar” and “How. Characterization by categories seems similar to characterization by variables but it allowed more precision.For example, in dim1 the coordinate of the category”How=lemon” is positive whereas “How=milk”’s is negative. This means that individuals whose coordinate is positive tend to have tea with lemon. Another example in Dim1 is sugar (negative coordinate) and no sugar (positive coordinate), i.e individuals whose coordinate is positive tend to have tea without sugar. ***
(more chapters to be added similarly as we proceed with the course!)